Cell Genomics
○ Elsevier BV
Preprints posted in the last 30 days, ranked by how well they match Cell Genomics's content profile, based on 162 papers previously published here. The average preprint has a 0.23% match score for this journal, so anything above that is already an above-average fit.
Dinh, B. L.; Wang, X.; Sheng, X.; Wan, P.; Srivastava, A. K.; Naseri, T.; Viali, S.; Wilkens, L.; Le Marchand, L.; Haiman, C. A.; Weeks, D.; Chiang, C. W. K.; Carlson, J. C.
Show abstract
Although genome-wide association studies (GWAS) now routinely reveal genetic associations and biological insights in millions of individuals, underrepresentation of global populations, such as those from Polynesia, continue to persist. These exclusions, often driven by logistical challenges and lack of data, prevent systematic identification of population-enriched associations, such as the association of the missense variant at the CREBRF locus to BMI and type 2 diabetes discovered commonly occurring in Polynesian populations due to its rarity in global populations. Armed with the recently updated TOPMed imputation panel that could benefit studies in diverse populations that previously had poorer imputation performance, we performed the first GWAS of Native Hawaiians and largest to date of Polynesian-ancestry populations (combined N up to 8,461) to identify population-enriched associations for 13 adiposity and cardiometabolic traits available across both cohorts: BMI, fasting glucose, fasting insulin, HDL, height, hip circumference, HOMA-IR, LDL, T2D, total cholesterol, triglycerides, waist circumference, and waist-hip ratio. We found 25 trait-loci associations that met genome-wide significance: 20 previously reported or known associations and 5 associations newly confirmed via meta-analysis. In particular, with improved statistical power, we were able to confirm the suspected association between the missense CREBRF variant with fasting glucose levels. The remaining 4 potentially novel loci-trait associations for BMI, LDL, and waist-hip ratio, however, were not replicated in multi-ethnic datasets from All-of-Us despite having reasonable power to replicate. The lack of Polynesian-enriched findings outside of the CREBRF locus informs the bounds of the effect sizes or frequency of any enriched variants, and suggests that further expansion of cohort sizes from this region of the world and improved imputation references specific to these populations are needed to identify more population-enriched associations.
Yang, Y.; Quintana-Urzainqui, I.; Pratt, T.
Show abstract
The 574 kilobase pair 16p11.2 microdeletion raises a person's odds for neurodevelopmental and energy balance conditions, particularly autism and obesity. There is considerable clinical heterogeneity and how much this reflects genetic versus environmental or stochastic factors is unclear. Forebrain interneurons originate from progenitors residing in the ventricular zone of the foetal ventral telencephalon and their perturbation is implicated in a number of 16p11.2 phenotypes prompting investigation of how the 16p11.2 microdeletion impacts their development. We differentiate human induced pluripotent stem cells (IPSCs), isogenic except for heterozygous 16p11.2 microdeletion to minimise confounding effects of genetic background, to ventral telencephalic interneuron progenitor fate in 2D culture and use single cell RNA sequencing to obtain single cell transcriptome populations for comparative bioinformatics. Hundreds of transcripts are differentially expressed and many associate with cell signalling, chromatin, neurodevelopmental conditions including autism, and obesity. Pertinently, we find that transcript level variation is significantly greater in 16p11.2 heterozygous progenitors than their isogenic wild type counterparts and this holds for sets of genes comprising regulons, gene-sets functionally connected by transcription factor regulation, and for randomly selected gene-sets indicating that the 16p11.2 locus itself has a genome-wide property in stabilising transcription between cells. Regulons with greatest increased variation in 16p11.2 heterozygous progenitors exhibit strong enrichment for cell cycle related genes, resonating with our earlier finding of increased cell cycle variability between 16p11.2 heterozygous organoids, and many are regulated by transcription factors associated with autism and/or obesity enforcing the idea that unusual transcriptional variation itself contributes to phenotypes.
Motegi, T.; Huang, F.; Campbell, J. D.
Show abstract
Local ancestry inference (LAI) enables high-resolution characterization of chromosomal segments inherited from distinct ancestral populations, offering unique insights into genetic architecture in admixed cohorts. While LAI is commonly performed with high-coverage whole-genome sequencing (WGS), the ability of other genotyping assays or varying sequencing depths has not been thoroughly benchmarked. In this study, we systematically evaluated the accuracy of LAI across SNP microarrays, whole-exome sequencing (WES), and ultra low-pass WGS (ULP-WGS) using diverse validation samples and state-of-the-art imputation pipelines. We show that ULP-WGS, when paired with GLIMPSE2, achieves robust accuracy at 0.25x coverage with a minimum genome window size of 0.5 centimorgans, with mean accuracy minus one standard deviation exceeding 95%. For WES, using "on-target" reads alone yields suboptimal performance, particularly for European and South Asian ancestries with accuracy less than 79.1% and 70.6%, respectively. However, incorporating "off-target" reads in WES and utilizing GLIMPSE2 substantially improved accuracy [≥]95% with a minimum window size of 0.2 centimorgans. We further evaluated formalin-fixed, paraffin-embedded (FFPE) samples and found that LAI could be performed successfully using WES data with accuracies of [≥]95% at a minimum window size of 0.5 centimorgans. In contrast, SNP microarrays did not achieve substantial accuracies at any window size ([≤]95%). Together, these results demonstrate that LAI is achievable without conventional high-coverage WGS and establish optimal parameters for LAI across platforms.
Dashtiahangar, M.; Siggers, T.
Show abstract
Most autoimmune disease-associated variants lie in non-coding regions, but the molecular mechanisms linking these variants to gene regulation remain poorly understood. A major unresolved challenge is to determine how disease alleles alter transcription factor (TF) binding, cofactor (COF) recruitment, and enhancer activity at scale. Here, we used the CASCADE method to profile differential binding of five TFs and ten COFs to 2,901 autoimmune disease-associated variants in Jurkat T cells, identifying 516 binding-modulating variants. Variants impacting binding were enriched among MPRA-defined expression-modulating variants and were strongly concordant with allele-specific reporter expression, linking altered TF/COF recruitment to enhancer activity. A majority of variants perturb binding of five major TF families -- ETS, RUNX, SP/KLF, OVOL/MYBL, and bHLH -- all of which have established roles in T cell biology. Notably, we find that ETS and RUNX factor binding is enriched at different variant functional classes, suggesting that they act through distinct regulatory mechanisms at disease loci. We describe allele-dependent regulator "switching" at several loci, where distinct complexes are found at reference and variants alleles, and we identify a recurrent regulatory module involving FOXM1 and the cofactors TIP60, BRD4, NCOA3, and NCOA1 assembling on ETS sites that tracks with gene expression. Together, this integrated biochemical and functional framework prioritizes autoimmune disease-associated variants by linking allele-specific TF/COF binding mechanisms to enhancer activity.
Ichikawa, Y.
Show abstract
Cross-population reversal of signed linkage disequilibrium (LD), or the "flip-flop" phenomenon, can arise when a tag SNP captures different extended haplotype backgrounds across populations. The MICA hepatocellular carcinoma susceptibility variant rs2596542 exemplifies this problem in the MHC, where signed LD reverses between Japanese and European populations but the relevant regulatory backgrounds are obscured by haplotypic complexity. We analyzed 7,303 biallelic SNVs surrounding rs2596542 across 26 populations using carrier-set topology classification followed by non-negative matrix factorization of carrier haplotypes. This identified two regulatory axes. Axis I, represented by components c4/c6, was population-stable and MICA-regulatory, with coherent MICA cis-eQTL enrichment and depletion for signed-LD reversal. Axis II, represented by component c5, was enriched for signed-LD reversal and showed an HLA-B{uparrow}/HLA-C{downarrow} expression signature with no MICA overlap across six GTEx tissues. In an independent Japanese HCC cohort (LIRI-JP, n = 122), Axis II-associated HLA-C downregulation remained after adjustment for clinical covariates, immune infiltration, and HLA-A expression. The previously proposed cross-population tag rs2244546 mapped to a population-stable component rather than Axis II. A parallel reanalysis of the COMT Val158Met flip-flop locus reproduced the signed-LD pattern reported by Lin et al1. and showed population-specific latent backgrounds among Val carriers. These results show that carrier-set topology combined with NMF can decompose composite marker alleles into functionally interpretable regulatory haplotype subspaces.
Dutta, S.
Show abstract
Genome-wide association studies have identified thousands of cancer risk variants in non-coding regions, yet their regulatory mechanisms remain largely uncharacterized. Here we present a regulatory annotation atlas of 6,983 genome-wide significant variants across 23 cancer types, scored using multimodal AlphaGenome predictions and integrated with ENCODE-4, Roadmap Epigenomics, and JASPAR 2024 annotations. Most variants (70.5%) fall outside annotated cis-regulatory elements; 27.7% overlap enhancers and 1.4% overlap promoters. Comparison with 6,626 position-matched eQTL control variants suggests that enhancer-classified variants carry 1.86-fold higher predicted effects (P = 1e-94) and promoter variants 7.84-fold (P = 2.5e-19). A composite prioritization score (RegVar-basic, excluding GWAS-derived pleiotropy and TF disruption, AUC = 0.650; RegVar-full, AUC = 0.675) outperforms CADD (0.499) and LINSIGHT (0.558) in this cancer-gene discrimination benchmark. Within-locus ranking across 2,626 GTEx DAP-G eQTL credible sets shows that RegVar identifies the highest-posterior-probability variant in 47.3% of loci (P = 7.0e-13), while CADD performs at chance. Predicted target genes show 67.7% concordance with GTEx eQTL assignments. Permutation-controlled motif analysis highlights NFKB1, STAT1, IRF1, and ARNT as exploratory permutation-enriched candidate transcription factors at cancer risk loci. This atlas provides a resource for interpreting non-coding cancer susceptibility variants. Because AlphaGenome uses expression-related training data, GTEx-based validations should be interpreted as partially orthogonal rather than fully independent.
Berk-Rauch, H. E.; Gherghina, L.-Y.; Huang, L.; Brand, A. H.; Chakravarti, A.
Show abstract
Autism spectrum disorder (ASD) exhibits a profound male biased sex ratio. While numerous genes have been implicated in ASD, the functional basis of this sex difference is unclear. One enticing hypothesis is genome-wide transcriptional regulation through estrogens and androgens. While hormone-mediated transcription is well-studied in reproductive tissues, its role in cortical development is poorly defined. Thus, we profiled androgen (AR) and estrogen (ESR1/ESR2) receptor expression in mid-gestation human fetal (GW16-24) cortex and complementary cortical organoid models, by single-cell RNA-seq. AR was primarily expressed in radial glia and intermediate progenitors while ESR1/ESR2 was more broadly distributed across multiple cell types of the developing cortex, although with the highest expression in radial glia. To study their genetic effects, we exposed iNeurons and cortical organoids to physiological levels of dihydro-testosterone (DHT) and estradiol (E2). DHT consistently up-regulated oxidative metabolism programs enriched in progenitor cells and down-regulated neuronal maturation pathways, while E2 exhibited a much more attenuated effect. The presence of DHT reduced NTRK2 (TrkB) expression, correlating with expression in fetal cortex where NTRK2 had significantly higher expression in progenitor cells of the female cortex, which is also reflected in the increased expression of AR in radial glia. Together, these data indicate that in developing human cortical lineages, sex hormones act as selective, cell-state-dependent modulators that tune metabolic and maturation programs rather than broadly reprogramming the genome. Thus, the effects of variation in transcriptional regulation through estrogens and androgens are likely to be minor, but not absent, in ASD.
Gao, S.; Sui, Y.; Tian, P.; Rao, X.; Yan, C.; Xu, Y.; Wang, T.
Show abstract
Educational attainment-related polygenic scores have been implicated in autism spectrum disorder (ASD), but how parental polygenic scores shape offspring phenotypes remains unclear. Using genotyping and exome-sequencing data from 142,357 individuals (55,252 ASD cases) in a large ASD cohort, we dissected the direct and indirect genetic effects of educational attainment-related polygenic scores on ASD phenotypes. Trio-model analyses showed that parental polygenic scores for educational attainment (PGSEA ) were associated with milder core ASD symptoms, including social deficits and repetitive behaviors, predominantly through indirect genetic effects, whereas their associations with comorbidities were driven predominantly by direct genetic effects. PGSEA was also significantly negatively associated with rare variant burden and prenatal factors, although these factors contributed largely independently to most phenotypes. Adjustment for full-scale intelligence quotient (FSIQ) and socioeconomic status (SES) partially attenuated the indirect effects of PGSEA on offspring phenotypes. Finally, higher parental PGSEA was associated with later age at diagnosis in offspring, partly through its protective effects on ASD phenotypes. These findings indicate that indirect genetic effects of parentalPGSEA contribute substantially to phenotypic variation in ASD and highlight family-mediated pathways as an important component of ASD heterogeneity.
Diblasi, C.; Kwak, J. S.; Manousi, D.; Arnyasi, M.; de Leon, A. V.-P.; Barson, N. J.; Saitou, M.
Show abstract
Structural variants (SVs) are a major source of genomic diversity, yet the evolutionary origins of SVs shared across divergent populations remain difficult to resolve. Shared SVs may reflect ancient polymorphism, recurrent mutation, introgression, or subsequent lineage-specific frequency change, but the relative contribution of these processes often remains difficult to distinguish. Here, we investigated SV evolution across four Atlantic salmon (Salmo salar) lineages differing in geography, Europe versus North America, and domestication status, wild versus farmed. Using sensitive SV discovery, stringent genotyping, local PCA, haplotype-distance analyses, and forward simulations, we tested whether broadly shared SVs behave as a single class of variation or separate into distinct evolutionary categories. We generated a high-confidence SV map and found that SVs were enriched in repetitive regions, particularly segmental duplications and LTR retrotransposons, consistent with genome architecture shaping SV formation. Nearly half of high-confidence SVs were shared across all four lineages despite deep continental divergence, and simulations showed that this broad sharing is more consistent with ancient persistence than recurrent mutation alone. In contrast, a small subset of large SVs exhibited complex PCA clustering and multimodal haplotype-distance distributions, consistent with recurrent formation at structurally unstable loci. Large SVs also showed contrasting frequency trajectories between continents, and one immune gene-rich copy-number variable region showed a marked frequency increase in domesticated European salmon. Together, these results show that shared SVs comprise distinct evolutionary categories shaped by ancient persistence, recurrent mutation, and lineage-specific frequency change.
Cai, L.; DeBerardinis, R. J.
Show abstract
Heterozygous carriers of autosomal recessive disease variants are conventionally considered unaffected, yet population-scale genomic datasets reveal subclinical carrier phenotypes. MMACHC encodes a cobalamin-processing protein whose biallelic loss causes cobalamin C deficiency, an inborn error of intracellular cobalamin metabolism. We performed an unbiased quantitative phenome-wide association screen in All of Us Research Program v8 to identify phenotypes associated with rare heterozygous MMACHC burden variants. Serum/plasma vitamin B12 was the top quantitative association. Carriers had higher circulating B12 than non-carriers in adjusted analyses, but also higher homocysteine, suggesting that elevated circulating B12 does not reflect improved intracellular cobalamin function. Carriers were less likely to fall below conventional B12 insufficiency thresholds, indicating a potential diagnostic blind spot. A pathway-wide rare-variant gene-burden (All-by-All) gene-burden analysis placed this finding in broader biological context. Burdens in genes related to circulating B12 binding or intestinal absorption were associated with lower circulating B12. In contrast, burdens in several genes involved in cellular delivery and intracellular cobalamin handling were associated with higher circulating B12. This step-specific directionality supports a model in which elevated circulating B12 can reflect impaired cellular handling and consequent systemic accumulation rather than improved cellular cobalamin availability. Because EHR-derived B12 is shaped by heterogeneous clinical and medication contexts, prospective carrier-enriched studies with standardized methylmalonic acid, homocysteine, diet, supplement, medication, comorbidity, and symptom ascertainment are needed to evaluate functional-marker-based screening.
Zhong, H.; Konciute, M. K.; Hu, J.; Menzies, J.; Cui, G.; Aranda, M.
Show abstract
Transposable elements (TEs) are pervasive components of eukaryotic genomes and major drivers of genome evolution, yet their contribution to cell-type-specific regulatory landscapes remains poorly understood, particularly in non-model marine invertebrates. Here, we integrated single-cell RNA sequencing with pseudo-aligned TE expression profiling to examine how TE transcription relates to cell type identity in the reef-building coral Acropora hemprichii. We constructed a cell atlas comprising 4,716 cells across eight major cell types. Notably, TE expression alone was sufficient to accurately resolve all major cell types, indicating that cell-type-specific transcriptional states are robustly reflected in TE activity patterns. We identified 9,759 expressed TEs, of which 333 exhibited strong cell-type-specific activity. These differentially expressed TE features were associated with nearby expressed genes and transcription factor loci, suggesting a relationship between cell-type-specific TE activity and local gene regulatory programs. Genes associated with cell-type-specific TEs were enriched for core coral physiological processes, including calcification, metabolite transport, and symbiosis-related functions. Together, these findings indicate that TE transcription is structured along coral cell-type identity and physiological specialization. Our study provides a single-cell-resolved framework for investigating TE-gene relationships in early-diverging metazoans and a community resource for future functional interrogation in reef-building corals.
Lee, H.; Sun, H.; Cao, X.; Karaahmet, B.; Li, Z.; Klein, H.-U.; Taga, M.; Wang, G.; De Jager, P. L.; Bennett, D. A.; Pinello, L.; Jin, X.; Mazumder, R.; Dey, K. K.
Show abstract
Spatial gene expression patterns underlie tissue organization, development, and disease, yet current methods for detecting spatially variable genes (SVGs) lack the flexibility to capture multi-scale structure, ensure robustness across platforms, and integrate with genetic data to assess disease relevance. We present Spacelink, a unified framework that models spatial variability of a gene at both whole-tissue and cell-type resolution using an adaptive mixture of data-driven spatial kernels and summarizes it using an Effective Spatial Variability (ESV) metric. Spacelink achieved up to 3.2x higher detection power over eight existing global SVG and cell-type SVG methods while showing consistently superior FDR control across 34 different simulation settings and also showed superior cross-platform concordance in matched tissue Visium and CosMx datasets. Applied to 3 healthy CosMx human tissues (brain cortex, lymph node, liver), Spacelink revealed that SVGs are highly informative for 113 complex traits and diseases (average GWAS sample size = 340,406). Spacelink showed up to 2.2x higher disease informativeness over competing methods in tissue-relevant complex diseases and traits, conditional on putative non-spatial expression-level confounders. Applied to a mouse organogenesis Stereo-seq atlas (8 developmental stages), Spacelink identified 145 genes with stage-associated ESV within brain independent of mean expression, that are enriched in pathways like Wnt signaling and Rap1 signaling characterizing early and late development, respectively. Integration with in vivo Perturb-seq targeting 35 de novo ASD risk genes revealed that perturbations in excitatory neurons and astrocytes preferentially altered spatially structured downstream gene programs (1.7-2.2x higher average ESV across stages than other cell types), many of which were enriched for polygenic autism GWAS loci. In neurodegeneration, analysis of 32 Visium dorsolateral prefrontal cortex samples spanning Alzheimers disease (AD) pathology stages identified 334 genes with decreasing ESV along amyloid burden (enriched for glycolysis) and 216 genes with decreasing ESV along tau tangle accumulation (enriched for apoptotic pathways). Several AD risk genes (PKM, CLU, GPI) showed conserved reductions in spatial variability with AD pathology in both human and 5xFAD mouse, with PKM linking to a colocalized splicing QTL and amyloid burden QTL variant. These results highlight the utility of Spacelink in decoding spatially variable gene programs that connect tissue architecture to disease genetics.
Leboine, C.; del Rio-Hortega, L.; Henry, N.; Zallio, M.; Bonacolta, A. M.; Belser, C.; Aury, J.-M.; Voolstra, C. R.; Hume, B. C.; Moussy, A.; Moulin, C.; Boissin, E.; Bourdin, G.; Iwankow, G.; Poulain, J.; Romac, S.; Tara Pacific Consortium coordinators, ; del Campo, J.; Allemand, D.; Planes, S.; Ziegler, M.; Wincker, P.; Carradec, Q.; Porcel, B. M.
Show abstract
Parasitism is one of the most widespread trophic strategies in nature, though its diversity and ecological distribution in marine ecosystems remain poorly characterized. Apicomplexa are a major clade of obligate parasites best known for medically important taxa, yet their diversity and distribution in the ocean is still largely unresolved. Here, we used metabarcoding data from the Tara expeditions to investigate the diversity, distribution, and environmental drivers of Apicomplexa across coral reef ecosystems and adjacent oceanic habitats. By integrating samples spanning planktonic communities, coral tissues, and marine sediments across multiple oceanic regions, we substantially expand the known phylogenetic breadth of marine apicomplexans. Although apicomplexans were generally low in relative abundance, they were widely distributed across marine environments. Community composition differed markedly among habitats. Corallicolid lineages were consistently associated with coral hosts, whereas planktonic samples harbored a greater diversity of apicomplexans, dominated by crustacean-associated gregarines. Sediments contained particularly high apicomplexan richness, including several poorly characterized groups. Capitalizing on the pan-Pacific transect of the expedition, we resolved biogeographic patterns in apicomplexan diversity across ocean basins: tropical regions showed the highest overall diversity, while polar environments contained distinct apicomplexan assemblages not detected in other ocean biomes. Together, these results highlight the extensive and previously underappreciated diversity of marine Apicomplexa and demonstrate that integrating multiple marine biomes is essential for resolving the phylogenetic and ecological breadth of parasitism in the ocean.
Duan, J.; Li, B.; Kulkarni, K.; Orquera-Tornakian, G.; Barth, D.; Wang, L.; Pandit, V.; Liou, J.; Munshi, N. V.; Hon, G. C.
Show abstract
Transcription factors (TFs) cooperatively drive gene regulatory networks (GRNs) to establish transcriptional states. Forced induction of TFs in combination can reprogram cell state by supplanting existing GRNs. Thus, TFs and GRNs are the building blocks to engineering transcriptional state. However, one key challenge is that the relationship between TF combinations and GRNs remains largely uncharacterized and difficult to accurately predict. Here, we apply single-cell overexpression screens to map the combinatorial activities of [~]100 TFs to gene expression states. Our analysis identifies diverse TF combinations driving cell-type specific regulatory programs. Notably, different TF combinations induce shared gene sets with cell-type specific functions, suggesting a modular regulatory architecture of the transcriptome. Furthermore, we define pairwise TF interactions and show that cooperative interactions improve transcriptional reprogramming. Finally, we developed tools to predict combinatorial TF phenotypes. These findings improve our understanding of cell state and how to manipulate it for biomedical applications. HIGHLIGHTSO_LICombinatorial over-expression screens for [~]100 transcription factors (TFs). C_LIO_LIDiverse TF combinations drive cell-type specific regulatory programs. C_LIO_LITF regulatory networks reveal a modular regulatory architecture of the transcriptome. C_LIO_LITF-TF interactions and predictive models enhance reprogramming cocktails. C_LI
Kwon, S.; Safer, J.; DiStefano, M.; Lebo, M.; Rehm, H. L.; Iqbal, S.
Show abstract
Missense variant interpretation remains a central challenge in clinical and medical genetics, with most observed variants being variants of uncertain significance (VUS). Computational variant effect predictors can achieve high pathogenicity classification performance, but without revealing the underlying mechanism and a translatable interpretation. Here we present the Protein Feature Enrichment Score (PFES), which quantifies the molecular context of missense variants through statistical enrichment of 103 protein structural, functional, and physicochemical features across 85,321 pathogenic and 130,719 control variants spanning 20 protein functional classes. We show that the protein feature (PF) enrichment patterns of variants are conserved within functional classes and vary substantially across classes, both in magnitude and directions depending on functional context. PFES not only partitions variants into PF-Enriched (pathogenic-like), PF-Neutral, and PF-Depleted (benign-like) categories but also provides a mechanistic interpretation by decomposing the score into subscores from biologically interpretable protein feature attributes. We demonstrate that PFES shows a high concordance with VUS reclassification and prioritization: across 596 genes, pathogenicity-leaning VUS-high variants were seven-fold enriched in PF-Enriched variants. PFES decomposition further revealed that loss-of-function and gain-of-function variants are distinguished by disproportionate enrichment of protein-protein interaction features in the latter. We computed PFES across 223 million possible missense variants (17.7% PF-Enriched) and built a publicly available resource that addresses not just whether a variant is pathogenic, but which protein characteristics are disrupted. Proteome-wide application across 20,153 genes prioritizes established rare disease genes and nominates therapeutically amenable targets whose pathogenic variation is driven by interpretable structural and functional protein feature disruption. One Sentence SummaryPFES is a proteome-wide resource to quantify the protein context of missense variants, enabling mechanistically transparent variant interpretation.
Palmer, D. S.; Hill, B.; Hodgson, S.; Joeloo, M.; Kalantzis, G.; Kousathanas, A.; Koyama, S.; Lu, W.; Namba, S.; Rodriguez, Z. B.; Shortt, J. A.; Sonehara, K.; Vartanian, N.; Vy, H. M. T.; Wade, I. A.; White, S. L.; Baya, N. A.; Chami, N.; Do, R.; Estrada, K.; Finer, S.; Genovese, G.; Guez, J.; Itan, Y.; Kanai, M.; Lassen, F. H.; Matsuda, K.; Moutsianas, L.; Peloso, G. M.; Priit, P.; Rader, D. J.; Rendon, A.; Rocheleau, G.; Sadeghi-Alavijeh, O.; Selvaraj, M. S.; Smit, R. A.; Wang, D.; Wigdor, E. M.; Yu, Z.; Colorado Center for Personalized Medicine, ; Estonian Biobank Research Team, ; Genes
Show abstract
Rare coding variants can have large effects on disease risk and provide direct routes from human genetics to disease mechanisms and therapeutic targets, but their discovery is constrained by sample size, particularly for low-prevalence diseases. Here we establish the Biobank Rare Variant Analysis (BRaVa) consortium, a global rare variant association resource that integrates sequencing and linked health-record data from ten biobanks and cohorts comprising over 1.2 million individuals across diverse ancestries. We performed gene-based meta-analyses of rare coding variation across 33 clinical endpoints and 11 quantitative traits. Aggregating evidence across biobanks and ancestries identified 514 gene-trait associations, including 31 not previously reported in prior studies or curated association resources following systematic literature review. Notably, 36.1% of gene-level associations were undetectable in any individual biobank, and 91 emerged only through cross-ancestry meta-analysis, demonstrating that federated integration enables discovery beyond the reach of single cohorts. Similar gains were observed at the variant level, where 25.0% of phenotype-locus associations were detectable only through meta-analysis. Effect size estimates were correlated across ancestries with concordant directions of effect, supporting the generalizability of rare variant associations. The identified signals implicate pathways involved in transcriptional and epigenetic regulation, metabolism, vascular and epithelial biology, and immune function, highlighting rare coding variation as an engine for biological discovery across medical record phenotypes. For example, damaging variation in ANKRD12 implicates inflammatory transcriptional dysregulation in asthma and chronic obstructive pulmonary disease, and ultra-rare predicted loss-of-function variants in NAA15 link protein acetylation processes to type 2 diabetes risk. BRaVa establishes a scalable framework and freely available community resource for rare variant meta-analysis across global biobanks. Public release of gene- and variant-level association summary statistics provides a reference map of rare coding variant associations to support disease gene discovery, biological interpretation, and therapeutic target prioritization as sequencing-linked health-record resources continue to expand.
Zabala, A.; Ascension, A. M.; Iniguez, S. G.; Iparraguirre, L.; Andres-Leon, E.; Matesanz, F.; Otaegui, D.; Munoz-Culla, M.
Show abstract
IntroductionCircular RNA quantitative trait loci (circQTLs) have emerged as a class of regulatory variants, but their mechanistic basis remains poorly characterized. Understanding how genetic variation influences circRNA biogenesis is essential to clarify their role in post-transcriptional gene regulation. MethodsWe systematically compared circQTLs with matched splicing (sQTL) and expression (eQTL) datasets. Using bootstrap-based Jaccard similarity analyses, we quantified genomic overlap patterns and assessed their statistical significance. We further validated these findings across independent circQTL studies. In addition, we analyzed the genomic distribution of circQTLs to identify enrichment patterns across functional genomic regions. ResultscircQTLs exhibited a statistically significant but modestly stronger genomic overlap with sQTLs compared to eQTLs. This pattern was consistent across independent datasets despite limited reproducibility of individual circQTL signals. Genomic annotation revealed distinct distributional patterns, including depletion in exonic regions and relative enrichment in non-coding genomic contexts compared to other QTL classes. DiscussionTogether, these results suggest that circRNA-associated regulatory variation is preferentially linked to splicing-related mechanisms rather than transcriptional control of host genes. However, the modest effect size indicates that this relationship is not exclusive, and likely reflects a mixture of shared splice-site regulatory effects and additional mechanisms specific to back-splicing that are not captured by conventional sQTL or eQTL frameworks. This dual architecture positions circRNA biogenesis at the interface between splicing dynamics, RNA structure, and higher-order genomic organization, supporting circQTLs as a distinct layer of post-transcriptional gene regulation.
Rafi, A. M.; Eraslan, G.; Fletez-Brant, K.
Show abstract
Sequence-to-function (S2F) models predict molecular phenotypes from DNA sequence and are increasingly applied to variant effect prediction (VEP), where the goal is to quantify how genetic variants alter gene expression. However, S2F model predictions are not uniformly reliable: accuracy varies substantially across variants, genes, and tissues, and current practice relies on crude magnitude thresholding to enrich for trustworthy predictions, which discards the majority of variants where S2F models could still provide signal. We developed gRely, a meta-modeling framework that estimates the probability that a given Borzoi VEP correctly predicts eQTL direction, using 1,121 features derived from the target variant, gene, and model outputs. On held-out tissues, gRely achieves a mean average precision of 0.885 (random baseline 0.744). Critically, within the low-magnitude regime where thresholding fails entirely, gRely identifies a high-confidence subset with 76% accuracy compared to a 58% baseline, recovering reliable predictions that magnitude filtering would discard. Interpretation via SHAP reveals that in this low-magnitude regime, gene expression level and cross-replicate signal concentration replace VEP magnitude as the primary discriminators of reliability. gRely is the first framework to provide per-prediction confidence scores for S2F model VEPs, and generalizes across architectures, producing consistent improvements on AlphaGenome predictions. By making reliability quantifiable, gRely enables principled filtering rather than blanket thresholding, and marks a step toward trustworthy deployment of S2F models in genomic research and clinical applications.
Flowers, B.; Lialios, P.; DiLollo, I.; Smith, N.; Whalley, J.; Lee, J.-S.
Show abstract
Across gastrointestinal (GI) cancers, shared malignant programs are layered onto strong anatomical, lineage, and microenvironmental variation, making it difficult to distinguish disease-relevant long noncoding RNAs (lncRNAs) from context-dependent transcriptional signals. We developed a pan-GI integrative framework to classify lncRNAs across colorectal adenocarcinoma, gastric adenocarcinoma, and esophageal cancer using bulk and single-cell transcriptomic resources. This framework evaluates lncRNAs across four complementary dimensions: recurrent tumor-associated expression, clinical association with disease progression and overall survival, co-expression network context, and malignant epithelial expression at single-cell resolution. Paired tumor-normal RNA-seq analyses identified extensive tumor-associated lncRNA dysregulation and defined recurrent pan-GI lncRNAs consistently upregulated across cancer types. Clinical analyses further nominated transcripts linked to tumor extension, nodal involvement, metastatic dissemination, progression-linked expression, and adverse overall survival. Co-expression network analysis identified lncRNAs embedded within disease-associated transcriptional modules, providing functional context for otherwise poorly annotated transcripts. In parallel, single-cell-derived metacell analysis nominated malignant epithelial-associated and detection-supported lncRNAs, helping distinguish tumor-compartment-associated signals from stromal, immune, endothelial, and other microenvironmental contributions. Together, this study establishes an evidence-structured pan-GI lncRNA resource and a generalizable prioritization strategy for nominating disease-associated noncoding transcripts. More broadly, the framework provides a transferable strategy for systematic lncRNA prioritization across other cancers and heterogeneous disease contexts.
E. Camarena, M.; Vara, C.; Papadopoulos, C.; Montanes, J. C.; Razquin-Sola, S.; Taillandier-Coindard, M.; Pak, H.; Müller, M.; Khelgati, N.; Garcia-Soriano, J. C.; Fortes, P.; Bassani-Sternberg, M.; Perera-Bel, J.; Alba, M. M.
Show abstract
Classical cancer germline-antigens (CGAs) are proteins that are expressed in the male germinal line but not in somatic tissues, and that can also become expressed in tumors. However, the vast majority of testis-specific transcripts are long non-coding RNAs (lncRNAs) rather than protein-coding genes. Since recent studies have shown that many lncRNAs contain non-canonical open reading frames (ncORFs) that are translated into small proteins, or microproteins, there could be a large class of non-canonical cancer-germline antigens (ncCGAs) that remains to be discovered. Here, we integrate ribosome profiling from human testis and cancer cell lines with paired tumor/normal transcriptomes from 917 patients across eight common cancer types to define a comprehensive catalog of ncCGAs. This set comprises 235 ncCGAs encoded by lncRNAs or mRNA untranslated regions (5UTRs and 3UTRs), compared to 192 canonical CGAs (cCGAs) with similar expression patterns. We show that ncCGAs are evolutionary young, consistent with recent de novo emergence in the rapidly evolving male germline. Moreover, a large fraction is expressed across multiple patients and cancer types, indicating recurrent reactivation mechanisms in tumors. We further find that ncCGAs are frequently located in cancer-amplified regions or associated with MYC or E2F-regulated pathways, which may explain their expression in cancer. Finally, we provide strong evidence that a subset of ncCGAs give rise to potentially immunogenic HLA class I bound peptides. Together, our results describe a previously unexplored class of tumor-restricted antigens with potential applications in cancer immunotherapy.